Goals for this session:

  1. Intro and finding data
  2. Importing data and basic manipulations
  3. Proportion
  4. Change over time
  5. Finishing touches

1. Introduction and finding data

The great thing about sports is that one, there is a ton of interest in it, and two, there’s a history of keeping stats. Combine those two together, and you get a ton of data about sports online. Indeed, the Times even wrote about the people who do it recently. That’s the good news. The bad news is that the best stuff – the really advanced metrics – are all behind paywalls and hidden in the leagues and teams. Why? Gambling.

But we can get a lot from ESPN, Sports Reference, College Football Stats, etc. And given how little sports data is being analyized and visualized by journalists, there’s a wide open field.

Let’s look at the data we’ll be working with. These are the game logs for a single basketball team. Just a random team from an outstanding midwestern university. Randomly chosen.

It has basic statistics for each game that team has played and their opponent’s stats. For basketball, this is pretty basic stuff.

But how about we look at every college basketball team and their game logs … all at once.

2. Importing data and basic manipulations

The library we’re going to use today is actually a collection of libraries called the tidyverse. The tidyverse contains the three libraries we’ll use: readr, dplyr and ggplot. In order, readr will read the data, dplyr will help us query the data and ggplot will visualize it.

We load it like this:

library(tidyverse)
## ── Attaching packages ───────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0     ✔ purrr   0.2.5
## ✔ tibble  2.0.1     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.2     ✔ stringr 1.3.1
## ✔ readr   1.3.1     ✔ forcats 0.3.0
## Warning: package 'tibble' was built under R version 3.5.2
## ── Conflicts ──────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

The first thing we’ll do is load our dataset into a data frame. That will give us something to work with. If we have an internet connection, we’ll load it like this. If not, I’ll bring an updated file that we can distribute locally. The gods clearly loved us when they gave us sneakernet.

logs <- read_csv("https://raw.githubusercontent.com/mattwaite/SPMC350-Sports-Data-Analysis-And-Visualization/master/Data/logs.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Date = col_date(format = ""),
##   HomeAway = col_character(),
##   Opponent = col_character(),
##   W_L = col_character(),
##   Blank = col_logical(),
##   Team = col_character(),
##   Conference = col_character()
## )
## See spec(...) for full column specifications.

So what we now have is our log data. To see it, click on the blue dot next to logs in the environment tab in the upper right. That will show us what each field is called. If you double click it, it will pull up a spreadsheet-like view of it.

For each chart type, we’re going to need to do some basic data analysis. For our first one, let’s calculate a simple metric for how games are going for a team: The score differential. Win big = Good. Lose big = Bad.

We’re going to do a couple of things here. First we’re going to tell R what data we’re working with. Then we’re going to use an operator that I find it’s easiest to just think of saying And Now Do This. So we’re going to give it our data, and calculate a new field. We’ll talk it through.

logs %>% mutate(differential = TeamScore-OpponentScore)

If we want to see it sorted, we can add another %>% at the end and then use dplyr’s version of sort, which is arrange.

logs %>% 
  mutate(differential = TeamScore-OpponentScore) %>%
  arrange(desc(differential))

So what this says is that the biggest blowout game in college basketball this season was Lamar beating Howard Payne by 89(!).

Now we need to save this as a new dataframe.

logs %>% 
  mutate(differential = TeamScore-OpponentScore) %>%
  arrange(desc(differential)) -> difflogs

Let’s chart a team’s season. You can do whatever you want, but I know a good one. A tale of woe. A tale of horrors. A tale of my employer. So we use filter and save that as a new dataframe. Saving to a dataframe can be done on the front end or the back. We did it on the backend last time. Here’s how to do it on the front end.

nu <- difflogs %>% filter(Team == "Nebraska Cornhuskers")

3. Proportion

The first chart we’ll produce is a simple bar chart. We’ll use our data we just created. The way ggplot works is you create a canvas to paint on, you give your data a geometry – a shape – and then you tell that geometry what data you’re using and what the aesthetic is. In this case, the aesthetic is what fields are being used to create the shape. Themes change the look and feel, and we’ll get into that later. For now, the basics.

ggplot() + geom_bar(data=nu, aes(x=Game, weight=differential))

If you’re a Nebraska fan … ouch. Things have not gone well lately.

4. Change over time

There’s two ways to look at change over time – there’s the game to game change, and then there’s the cumulative change over the course of a season.

Let’s look at how the Big Ten has shot over the course of the season.

logs %>% filter(Conference == "Big Ten") -> big10

So each game has a date and a TeamFGPCT, which is always that team regardless of home or away.

ggplot() + geom_line(data=big10, aes(x=Date, y=TeamFGPCT, group=Team))

Ah, the hairball. How to fix this? Layering. So let’s pick some teams to focus on. Who are the best shooting teams in the Big Ten? This is knowable.

big10 %>% 
  group_by(Team) %>% 
  summarize(
    season_attempts=sum(TeamFGA), 
    season_fg = sum(TeamFG), 
    season_pct = season_fg/season_attempts
    ) %>% 
  arrange(desc(season_pct))

So Michigan State is best, Penn State is worst. Let’s chart how their two seasons have gone.

ms <- big10 %>% filter(Team == "Michigan State Spartans")
ps <- big10 %>% filter(Team == "Penn State Nittany Lions")

Now let’s layer in our data and use colors to bring some to the front and some to the back.

ggplot() + 
  geom_line(data=big10, aes(x=Date, y=TeamFGPCT, group=Team), color="light grey") +
  geom_line(data=ms, aes(x=Date, y=TeamFGPCT, group=Team), color="dark green") +
  geom_line(data=ps, aes(x=Date, y=TeamFGPCT, group=Team), color="dark blue")

Better, but does it tell us much? Eh.

Let’s look at step charts, a form that tracks cumulative change over time.

Nebraska, at one time, was one of the best basketball teams in the Big Ten. And then, in January, it went to hell in a handbasket. You can really see it in the cumulative point differential on the season.

First we need to calculate the game differential then we need to calculate the cumulative differential for each team. We can do that in one step.

big10cumulative <- big10 %>% 
  mutate(differential = TeamScore-OpponentScore) %>% 
  group_by(Team) %>% 
  mutate(totaldifferential = cumsum(differential))

msc <- big10cumulative %>% filter(Team == "Michigan State Spartans")
nuc <- big10cumulative %>% filter(Team == "Nebraska Cornhuskers")

Now we can borrow from our line chart and layer these in. First we’ll do the Big Ten, then Michigan State, then Nebraska. The geometry is now geom_step instead of geom_line, but the aesthetics are pretty much the same.

ggplot() + 
  geom_step(data=big10cumulative, aes(x=Date, y=totaldifferential, group=Team), color="light grey") + 
  geom_step(data=msc, aes(x=Date, y=totaldifferential, group=Team), color="dark green") +
  geom_step(data=nuc, aes(x=Date, y=totaldifferential, group=Team), color="red")

Ouch.

5. Finishing touches

There is an almost limitless number of thing we can change with ggplot to make it look different. Every piece of the chart, every element, every font, face and size we can change. The number of choices are overwhelming. So let’s start simple. Let’s just add some text to this mess.

ggplot() + 
  geom_step(data=big10cumulative, aes(x=Date, y=totaldifferential, group=Team), color="light grey") + 
  geom_step(data=msc, aes(x=Date, y=totaldifferential, group=Team), color="dark green") +
  geom_step(data=nuc, aes(x=Date, y=totaldifferential, group=Team), color="red") +
   labs(x="Date", y="Score differential", title="A tale of two seasons", subtitle="Michigan State and Nebraska were neck and neck. And then the swoon happened.", caption="Source: Sports-Reference.com | By Matt Waite") 

There’s a designer somewhere having a conniption about this. Let’s create some text heirarchy here.

ggplot() + 
  geom_step(data=big10cumulative, aes(x=Date, y=totaldifferential, group=Team), color="light grey") + 
  geom_step(data=msc, aes(x=Date, y=totaldifferential, group=Team), color="dark green") +
  geom_step(data=nuc, aes(x=Date, y=totaldifferential, group=Team), color="red") +
   labs(x="Date", y="Score differential", title="A tale of two seasons", subtitle="Michigan State and Nebraska were neck and neck. And then the swoon happened.", caption="Source: Sports-Reference.com | By Matt Waite") +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size=9)
  )

Better. How about we get rid of some things. The label on the bottom isn’t needed. Those labels don’t need to be so large. And the tick marks next to the grid lines drive me nuts.

ggplot() + 
  geom_step(data=big10cumulative, aes(x=Date, y=totaldifferential, group=Team), color="light grey") + 
  geom_step(data=msc, aes(x=Date, y=totaldifferential, group=Team), color="dark green") +
  geom_step(data=nuc, aes(x=Date, y=totaldifferential, group=Team), color="red") +
   labs(x="Date", y="Score differential", title="A tale of two seasons", subtitle="Michigan State and Nebraska were neck and neck. And then the swoon happened.", caption="Source: Sports-Reference.com | By Matt Waite") +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size=10),
    axis.title.x = element_blank(),
    axis.text = element_text(size = 7),
    axis.ticks = element_blank()
  )